-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex_program class for use with all regex APIs #11927
Add regex_program class for use with all regex APIs #11927
Conversation
Codecov ReportBase: 87.47% // Head: 88.07% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11927 +/- ##
================================================
+ Coverage 87.47% 88.07% +0.59%
================================================
Files 133 135 +2
Lines 21826 22025 +199
================================================
+ Hits 19093 19398 +305
+ Misses 2733 2627 -106
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the refactor. I mostly have questions about the current encapsulation of the program impl class.
Also, should we be concerned about overloading the "prog" name given that we already use this on device for the reprog_device
class? Are the two similar enough that it makes sense to have similar names? Should we consider an alternative?
@@ -24,6 +24,9 @@ | |||
|
|||
namespace cudf { | |||
namespace strings { | |||
|
|||
struct regex_program; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to include a note in the developer guide about our preferred uses of forward declarations of includes (if you can come up with a pithy summary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Encapsulating information in the regex_program
will help keep the public APIs constrained while adding new features that can be controlled via the program.
@gpucibot merge |
FYI, surprisingly this PR has broken a large number of RAPIDS Accelerator tests related to regex. Tests pass before this change and fail afterwards. In at least one of the tests, the nature of the failure appears to be solely related to the data column. There are null characters (or uninitialized memory?) appearing in blocks of rows periodically in the output. I'll try to narrow this down to a repro case in C++. |
I just noticed this too. I'm working on a fix. |
Fixes error in `working_memory_size()` member function passing the parameters incorrectly. This was introduce in #11927 and found in the nightly compute-sanitizer check. https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/branches/job/cudf-gpu-build-branch-22.12/19/CUDA=11.5/testReport/junit/cudamemcheck/STRINGS_TEST/StringsContainsTests_ContainsTest/ Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Jason Lowe (https://github.com/jlowe) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #12119
Deprecates the libcudf regex APIs that do not accept `cudf::strings::regex_program` objects. The libcudf regex functions were converted to accept `regex_program` objects instead of a pattern string and regex-flags directly in PR #11927 Most of this is removing calls to the deprecated functions in the gtests. The declarations in the headers had to be reordered to make the reference links work in the doxygen output. These deprecated functions will be removed in a future release. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #12891
Description
Adds a new
regex_program
class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will compile and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings).The object can also be queried to help with resource allocation/expectations.
The main files to review are the new
regex_program*
source files plus the corresponding changes inregexec.cpp
(renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object.No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs.
Currently, all regex APIs are duplicated -- the original API plus a new one accepting a
regex_progam
object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release.This will help with changes needed for #10852
Checklist